AITopics

2605.06474

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.68)

Liu, Haolin, Snyder, Braham, Wei, Chen-Yu

On the Complexity of Offline Reinforcement Learning with $Q^\star$-Approximation and Partial Coverage

arXiv.org Machine LearningFeb-13-2026

We study offline reinforcement learning under $Q^\star$-approximation and partial coverage, a setting that motivates practical algorithms such as Conservative $Q$-Learning (CQL; Kumar et al., 2020) but has received limited theoretical attention. Our work is inspired by the following open question: "Are $Q^\star$-realizability and Bellman completeness sufficient for sample-efficient offline RL under partial coverage?" We answer in the negative by establishing an information-theoretic lower bound. Going substantially beyond this, we introduce a general framework that characterizes the intrinsic complexity of a given $Q^\star$ function class, inspired by model-free decision-estimation coefficients (DEC) for online RL (Foster et al., 2023b; Liu et al., 2025b). This complexity recovers and improves the quantities underlying the guarantees of Chen and Jiang (2022) and Uehara et al. (2023), and extends to broader settings. Our decision-estimation decomposition can be combined with a wide range of $Q^\star$ estimation procedures, modularizing and generalizing existing approaches. Beyond the general framework, we make further contributions: By developing a novel second-order performance difference lemma, we obtain the first $ε^{-2}$ sample complexity under partial coverage for soft $Q$-learning, improving the $ε^{-4}$ bound of Uehara et al. (2023). We remove Chen and Jiang's (2022) need for additional online interaction when the value gap of $Q^\star$ is unknown. We also give the first characterization of offline learnability for general low-Bellman-rank MDPs without Bellman completeness (Jiang et al., 2017; Du et al., 2021; Jin et al., 2021), a canonical setting in online RL that remains unexplored in offline RL except for special cases. Finally, we provide the first analysis for CQL under $Q^\star$-realizability and Bellman completeness beyond the tabular case.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

2602.12107

Country:

North America > United States > Virginia (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Neural Information Processing SystemsFeb-9-2026, 08:45:25 GMT

2a095b46705d7e6f81fc50270fe770c2-Supplemental-Conference.pdf

arxiv preprint arxiv, q-function, realizability, (11 more...)

Country: North America > United States > Illinois > Cook County > Chicago (0.04)

Genre:

Research Report > Experimental Study (0.68)
Research Report > Strength High (0.46)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Neural Information Processing SystemsFeb-9-2026, 08:45:21 GMT

Offline Minimax Soft-Q-learning Under Realizability and Partial Coverage

We tackle this by introducing two novel value-based algorithms.

artificial intelligence, machine learning, reinforcement learning, (13 more...)

Country: North America > United States > Illinois > Cook County > Chicago (0.04)

Genre:

Research Report > Experimental Study (0.68)
Research Report > Strength High (0.46)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

van der Laan, Lars, Kallus, Nathan

Stationary Reweighting Yields Local Convergence of Soft Fitted Q-Iteration

arXiv.org Machine LearningJan-1-2026

Fitted Q-iteration (FQI) and its entropy-regularized variant, soft FQI, are central tools for value-based model-free offline reinforcement learning, but can behave poorly under function approximation and distribution shift. In the entropy-regularized setting, we show that the soft Bellman operator is locally contractive in the stationary norm of the soft-optimal policy, rather than in the behavior norm used by standard FQI. This geometric mismatch explains the instability of soft Q-iteration with function approximation in the absence of Bellman completeness. To restore contraction, we introduce stationary-reweighted soft FQI, which reweights each regression update using the stationary distribution of the current policy. We prove local linear convergence under function approximation with geometrically damped weight-estimation errors, assuming approximate realizability. Our analysis further suggests that global convergence may be recovered by gradually reducing the softmax temperature, and that this continuation approach can extend to the hardmax limit under a mild margin condition.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

2512.23927

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.88)

van der Laan, Lars, Kallus, Nathan

Fitted Q Evaluation Without Bellman Completeness via Stationary Weighting

arXiv.org Machine LearningJan-1-2026

Fitted Q-evaluation (FQE) is a central method for off-policy evaluation in reinforcement learning, but it generally requires Bellman completeness: that the hypothesis class is closed under the evaluation Bellman operator. This requirement is challenging because enlarging the hypothesis class can worsen completeness. We show that the need for this assumption stems from a fundamental norm mismatch: the Bellman operator is gamma-contractive under the stationary distribution of the target policy, whereas FQE minimizes Bellman error under the behavior distribution. We propose a simple fix: reweight each regression step using an estimate of the stationary density ratio, thereby aligning FQE with the norm in which the Bellman operator contracts. This enables strong evaluation guarantees in the absence of realizability or Bellman completeness, avoiding the geometric error blow-up of standard FQE in this setting while maintaining the practicality of regression-based evaluation.

init, machine learning, reinforcement learning, (14 more...)

2512.23805

Country: North America > United States (0.67)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Data Science (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.92)

Liu, Haolin, Wei, Chen-Yu, Zimmert, Julian

An Improved Model-Free Decision-Estimation Coefficient with Applications in Adversarial MDPs

arXiv.org Artificial IntelligenceOct-13-2025

We study decision making with structured observation (DMSO). Previous work (Foster et al., 2021b, 2023a) has characterized the complexity of DMSO via the decision-estimation coefficient (DEC), but left a gap between the regret upper and lower bounds that scales with the size of the model class. To tighten this gap, Foster et al. (2023b) introduced optimistic DEC, achieving a bound that scales only with the size of the value-function class. However, their optimism-based exploration is only known to handle the stochastic setting, and it remains unclear whether it extends to the adversarial setting. We introduce Dig-DEC, a model-free DEC that removes optimism and drives exploration purely by information gain. Dig-DEC is always no larger than optimistic DEC and can be much smaller in special cases. Importantly, the removal of optimism allows it to handle adversarial environments without explicit reward estimators. By applying Dig-DEC to hybrid MDPs with stochastic transitions and adversarial rewards, we obtain the first model-free regret bounds for hybrid MDPs with bandit feedback under several general transition structures, resolving the main open problem left by Liu et al. (2025). We also improve the online function-estimation procedure in model-free learning: For average estimation error minimization, we refine the estimator in Foster et al. (2023b) to achieve sharper concentration, improving their regret bounds from $T^{3/4}$ to $T^{2/3}$ (on-policy) and from $T^{5/6}$ to $T^{7/9}$ (off-policy). For squared error minimization in Bellman-complete MDPs, we redesign their two-timescale procedure, improving the regret bound from $T^{2/3}$ to $\sqrt{T}$. This is the first time a DEC-based method achieves performance matching that of optimism-based approaches (Jin et al., 2021; Xie et al., 2023) in Bellman-complete MDPs.

algorithm, artificial intelligence, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2510.08882

Genre: Research Report (0.81)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.34)

Neural Information Processing SystemsOct-8-2025, 08:19:44 GMT

2a095b46705d7e6f81fc50270fe770c2-Supplemental-Conference.pdf

arxiv preprint arxiv, q-function, realizability, (11 more...)

Country: North America > United States > Illinois > Cook County > Chicago (0.04)

Genre:

Research Report > Experimental Study (0.68)
Research Report > Strength High (0.46)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.99)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.69)

Neural Information Processing SystemsOct-8-2025, 08:19:41 GMT

2a095b46705d7e6f81fc50270fe770c2-Paper-Conference.pdf

arxiv preprint arxiv, q-function, realizability, (11 more...)

Country: North America > United States > Illinois > Cook County > Chicago (0.04)

Genre:

Research Report > Experimental Study (0.68)
Research Report > Strength High (0.46)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.99)

Jiang, Nan, Xie, Tengyang

Offline Reinforcement Learning in Large State Spaces: Algorithms and Guarantees

arXiv.org Machine LearningOct-7-2025

This article introduces the theory of offline reinforcement learning in large state spaces, where good policies are learned from historical data without online interactions with the environment. Key concepts introduced include expressivity assumptions on function approximation (e.g., Bellman completeness vs. realizability) and data coverage (e.g., all-policy vs. single-policy coverage). A rich landscape of algorithms and results is described, depending on the assumptions one is willing to make and the sample and computational complexity guarantees one wishes to achieve. We also discuss open questions and connections to adjacent areas.

algorithm, assumption, international conference, (12 more...)

2510.04088

Country:

North America > United States > Wisconsin > Dane County > Madison (0.04)
North America > United States > Massachusetts > Middlesex County > Belmont (0.04)
North America > United States > Illinois > Champaign County > Urbana (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report (0.50)
Workflow (0.46)

Industry: Health & Medicine (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)